This report is an evaluation of our team’s findings for the SFO team
using the SFO_survey_withText.txt dataset. The dataset
represents San Francisco International Airport 2010 Customer Survey
responses. We discuss the methods, relevant EDA along with any helpful
visuals, articulate how we address the questions, present results,
interpret findings, and conclusions. Initially, we perform EDA and data
cleaning on the dataset before conducting any analysis.
We performed light EDA by examining the percentage of missing values within the dataset. We found that 38 variables were missing more than 70% of their overall data along with 2 additional variables missing roughly 53% and 20% of their data as well. This left 61 variables with only missing less than 9% of their data.
We decided variables with less than 9% of missing data to remain in
the final dataset along with the inclusion of Q19, whom has
20% of missing data. This is variable represents customer’s Household
income, which we will later need to examine whether it has potential
influence on customer’s satisfactory at SFO on a whole.
During data-preprocessing, all variables that had more than 10% of
missing data were removed from the final dataset to perform data
analysis with the exception for Q19. Instead of removing NA
values, we aggregated their means by columns for each desirable variable
used in the final analysis in respect to each section for Part A.
We removed any customers that left a blank, whom were not applicable, refused, or did not specified there response, which was a total of only 31 customers. Note that 29 of those individuals had reported they were not applicable as their rating response.
In Part A Section, the SFO team have three (3) specific questions they had wanted the team to investigate.
For A.1., knowing that customers were asked to rate
their opinion of the “SFO Airport as a whole” on a scale from 1
(“unacceptable”) to 5 (“outstanding”). The executives want to know if
there are patterns across the satisfied or dissatisfied customers based
on demographic characteristics, such as sex, age group, and income
level.
Initially, we performed more EDA by performing a univariate and multivariate analysis to examine the variables distributions along with their relationships to one another before conducting Latent Class Analysis (LCA). For better visualization, we created factorized variables on the desired variables used within this analysis. The following variables we had factored were gender, age group, household income level, and customer ratings on SFO airport.
For our univariate analysis, we examined the distribution for each
variable independently using a barplot. We noticed the data is heavier
weighted on customer ratings with Above-Average experience
at the SFO airport. As for the demographic characteristics in the
customer basis, we notice the data to have 11% more males than females
with the peak age group lying between 25 to34 and household income from
$50,000 to $100,000.
Due to the large disproportionate number of responses for customer
ratings reporting Unacceptable and
Below-Average, we decided to define the customers whom
rated Average and below to be dissatisfied, while responses
reporting Above-Average and greater were classified as
satisfied.
For our bivariate analysis, we examined the relationships among
customers rating being binary of Satisfied or Dissatisfied among their
demographic characteristics using a table and barplots. Please note that
each of the reported numbers for the bars are labeled inversely, where
the bottom number reports the top region representing dissatisfied
customers and the top number reports the bottom region representing
satisfied customers.
## # A tibble: 103 × 5
## # Groups: BinaryRating, Gender_fa, AgeGroup_fa [27]
## BinaryRating Gender_fa AgeGroup_fa IncomeLevel_fa n
## <fct> <fct> <fct> <fct> <int>
## 1 Dissatisfied Male 18-24 < 50k 8
## 2 Dissatisfied Male 18-24 50k-100k 10
## 3 Dissatisfied Male 18-24 100k-150k 4
## 4 Dissatisfied Male 18-24 150k < 1
## 5 Dissatisfied Male 25-34 < 50k 26
## 6 Dissatisfied Male 25-34 50k-100k 63
## 7 Dissatisfied Male 25-34 100k-150k 15
## 8 Dissatisfied Male 25-34 150k < 15
## 9 Dissatisfied Male 35-44 < 50k 5
## 10 Dissatisfied Male 35-44 50k-100k 63
## # ℹ 93 more rows
Based on the results, we noticed the trend among satisfied and
dissatisfied customers remain the same for each demographic
characteristic sub-group with the exception for customer’s household
income levels. Although both customer groups are mostly represented by
household incomes of $50,000 to $100,000, the
second highest representing group for satisfied customers are those whom
make under $50,000, while for dissatisfied customers are
those whom make over $150,000. This possibly suggests that
customer ratings may be more influenced by customer’s household incomes
rather than their gender of age group.
Using the table, we find large clustering of classes being accounted for now suggesting these variables would be best used for Latent Class Analysis by providing a better breakdown of the potential class groups that we can see to be forming above possibly.
Now for our model analysis, we perform LCA due to being best to work with discrete categorical data on multiple groups by unique response propensities. We run through 4 LCA models fitting from 3 to 6 latent classes based on income level, age group, gender, and customer rating on SFO on a whole. In addition, since we are determined whether the characteristics has a influence on not just each level of the customer ratings but whether they are satisfied or not, another set of 4 models were run in terms of the binary classifier made earlier.
Since there are no tools to automatically diagnosis the best number of classes, we perform many to conduct class selection, where we compare among model performance using their Akaike Information Criterion (AIC) values.
## [,1]
## class3_dfa1 30130.13
## class4_dfa1 30096.33
## class5_dfa1 30086.33
## class6_dfa1 30097.74
## [,1]
## class3_binary_dfa1 27023.12
## class4_binary_dfa1 27001.69
## class5_binary_dfa1 27009.31
## class6_binary_dfa1 27008.10
As a result, we found that the LCA model with 5 and 4 latent classes performed the best among the 4 models for customer ratings and its binary classifier, respectfully. Thus, we then view the model’s performance along with examining its probabilities to interpret each defined class among the two models.
## Conditional item response (column) probabilities,
## by outcome variable, for each class (row)
##
## $IncomeLevel_num
## Pr(1) Pr(2) Pr(3) Pr(4)
## class 1: 0.0000 0.3931 0.2371 0.3698
## class 2: 0.1819 0.7198 0.0983 0.0000
## class 3: 0.6134 0.0000 0.1029 0.2836
## class 4: 0.9931 0.0004 0.0000 0.0065
## class 5: 0.2572 0.5877 0.1263 0.0289
##
## $AgeGroup_num
## Pr(1) Pr(2) Pr(3) Pr(4) Pr(5) Pr(6) Pr(7)
## class 1: 0.0000 0.0192 0.1213 0.2805 0.3058 0.2009 0.0724
## class 2: 0.0079 0.0746 0.4751 0.2546 0.0000 0.0873 0.1005
## class 3: 0.0205 0.0000 0.3771 0.0778 0.2840 0.0787 0.1619
## class 4: 0.0087 0.5251 0.2769 0.0356 0.0954 0.0000 0.0583
## class 5: 0.0119 0.1442 0.0481 0.0000 0.1996 0.3262 0.2700
##
## $Gender_num
## Pr(1) Pr(2)
## class 1: 0.5874 0.4126
## class 2: 0.6126 0.3874
## class 3: 1.0000 0.0000
## class 4: 0.3656 0.6344
## class 5: 0.2948 0.7052
##
## $Rating_num
## Pr(1) Pr(2) Pr(3) Pr(4) Pr(5)
## class 1: 0.0010 0.0172 0.2618 0.6173 0.1027
## class 2: 0.0014 0.0179 0.2523 0.5823 0.1461
## class 3: 0.0140 0.0304 0.1952 0.4125 0.3479
## class 4: 0.0000 0.0000 0.0972 0.7656 0.1372
## class 5: 0.0000 0.0084 0.0931 0.4246 0.4739
##
## Estimated class population shares
## 0.45 0.2847 0.0507 0.069 0.1455
##
## Predicted class memberships (by modal posterior prob.)
## 0.4605 0.3125 0.0421 0.0712 0.1136
##
## =========================================================
## Fit for 5 latent classes:
## =========================================================
## number of observations: 3203
## number of estimated parameters: 74
## residual degrees of freedom: 205
## maximum log-likelihood: -14969.17
##
## AIC(5): 30086.33
## BIC(5): 30535.65
## G^2(5): 185.9829 (Likelihood ratio/deviance statistic)
## X^2(5): 195.8229 (Chi-square goodness of fit)
##
## ALERT: iterations finished, MAXIMUM LIKELIHOOD NOT FOUND
##
As a result, all of the classes lead towards customers that are
highly satisfied regardless of their grouping, which we had presumed
based on our EDA. Class 1 mostly represents customers, whom are
male, 45-54, and make
$50,000-$100,000 that rates Above-Average.
Class 2 mostly represents customers, whom are male,
25-34, and make $50,000-$100,000 that rates
Above-Average with some Average. Class 3
mostly represents customers, whom are completely male,
25-34, and make Under $50,000 that rates
Outstanding. Class 4 mostly represents customers, whom are
female, 625-34, and make
$50,000 - $100,000 that rates seem to roughly split between
Above-Average and Outstanding with some
Average. Class 5 mostly represents customers, whom are
female, 18-25, and almost all make
Under $50,000 that rates Above-Average with
some Average. In addition, we notice that Class 1 is the
heaviest population class of almost representing 50% of the total data
with Class 2 being half the size of it.
## Conditional item response (column) probabilities,
## by outcome variable, for each class (row)
##
## $IncomeLevel_num
## Pr(1) Pr(2) Pr(3) Pr(4)
## class 1: 0.0000 0.1384 0.3341 0.5276
## class 2: 0.7321 0.2206 0.0473 0.0000
## class 3: 0.1763 0.6824 0.0925 0.0488
## class 4: 0.0000 0.9295 0.0705 0.0000
##
## $AgeGroup_num
## Pr(1) Pr(2) Pr(3) Pr(4) Pr(5) Pr(6) Pr(7)
## class 1: 0.0030 0.0203 0.1158 0.2532 0.3124 0.2058 0.0895
## class 2: 0.0122 0.3354 0.2530 0.0008 0.1151 0.1216 0.1619
## class 3: 0.0016 0.0473 0.3774 0.3318 0.1102 0.0739 0.0578
## class 4: 0.0126 0.0371 0.1400 0.0506 0.1871 0.3445 0.2280
##
## $Gender_num
## Pr(1) Pr(2)
## class 1: 0.5884 0.4116
## class 2: 0.4413 0.5587
## class 3: 0.6791 0.3209
## class 4: 0.3523 0.6477
##
## $BinaryRating
## Dissatisfied Satisfied
## class 1: 0.2683 0.7317
## class 2: 0.1064 0.8936
## class 3: 0.2910 0.7090
## class 4: 0.1969 0.8031
##
## Estimated class population shares
## 0.3191 0.1736 0.3504 0.1569
##
## Predicted class memberships (by modal posterior prob.)
## 0.3234 0.1711 0.3428 0.1627
##
## =========================================================
## Fit for 4 latent classes:
## =========================================================
## number of observations: 3203
## number of estimated parameters: 47
## residual degrees of freedom: 64
## maximum log-likelihood: -13453.85
##
## AIC(4): 27001.69
## BIC(4): 27287.07
## G^2(4): 87.76119 (Likelihood ratio/deviance statistic)
## X^2(4): 85.32092 (Chi-square goodness of fit)
##
## ALERT: iterations finished, MAXIMUM LIKELIHOOD NOT FOUND
##
As a result, all of the classes lead towards being satisfied regardless
of their grouping, which we had presumed based on our previous LCA
model. Class 1 mostly represents customers, whom are
male,
45-54, and make Over $150,000 that rates
satisfied. Class 2 mostly represents customers, whom are
male, 25-34, and make
$50,000 - $100,000 that rates satisfied. Class
3 mostly represents customers, whom are female,
55-64, and make $50,000 - $100,000 that rates
satisfied. Class 4 mostly represents customers, whom are
female, 18-24, and make
Under $150,000 that rates satisfied.
Although all classes were only of customers whom were classified as
satisfied, we noticed in the probabilities that the increase of
dissatisfied customers were among males more so than females and were
among incomes around lower incomes like $50,000-$100,000
among age groups that are 25-34 or 45-54. In addition, here the classes
seem to be more evenly proportionate from a range of 0.17 to 0.3 with
Class 2 being the larged populated group. However, despite the slight
decrease in satisfactory among customers of these types there is no
evidence showing there’s influence among customers demographic
characteristics. In comparison to the previous LCA model,
For A.2., the executives also want to know if customer
satisfaction can be broken down into different attributes of the
airport. Knowing this will help the team target specific strengths or
areas of improvement. The central feature the customer satisfaction
survey is a 14-question portion of the survey asking customers to rate
satisfaction with different aspects of the airport represented by
Question 6. In addition, the executives wanted us to perform a
quantitative analysis to determine if there are broad themes that emerge
from this part of the survey.
Initially, we decided to take a quick look at the average score per
airport attribute. Each letter in following Q6 correlates
to an attribute of the airport. For example, A is artwork
and exhibitions, B is restaurants, and so on. To start
let’s take a look at the highest scoring attributes on average
below:
## ArtworkExhibitions Restaurants RetailShops
## 4.321274 3.976500 3.986085
## SignsDirections Escalators InformationScreens
## 3.963203 4.179035 4.004020
## InformationBoothsLower InformationBoothsUpper SignsRoadways
## 4.869821 4.815708 4.667594
## ParkingFacility AirTrain LongTermParking
## 5.090291 5.147495 5.459802
## Airportrentalcarcenter Overall
## 5.246444 3.946506
As a result, it appears that the three highest rated attributes on average are the AirTrain, long term parking and lot shuttle, and the airport rental car center. These are all amenities that score the highest on average according to our respondents ratings.
We decided to perform Latent Class Analysis to better examine the potential groupings of these amenities and see the correlation between them. We run through 4 LCA models fitting from 3 to 6 latent classes based on several variables reflecting customer response to different aspects of the airport.
Like before, we used the AIC values to determine the best fitted model.
## [,1]
## class3_dfa2 105026.46
## class4_dfa2 102102.42
## class5_dfa2 99854.35
## class6_dfa2 99053.34
As a result, the lowest AIC here correlates to having 6 latent classes. We then examine the model’s classes by viewing its probabilities and plot to see the type of factors related to each customer type class.
## Conditional item response (column) probabilities,
## by outcome variable, for each class (row)
##
## $ArtworkExhibitions
## Pr(1) Pr(2) Pr(3) Pr(4) Pr(5) Pr(6)
## class 1: 0.0170 0.0950 0.4447 0.1894 0.0585 0.1954
## class 2: 0.0042 0.0996 0.5274 0.2670 0.0663 0.0355
## class 3: 0.0096 0.0109 0.0945 0.4582 0.3420 0.0849
## class 4: 0.0016 0.0018 0.0953 0.2284 0.3507 0.3222
## class 5: 0.0000 0.0233 0.1982 0.5056 0.1635 0.1093
## class 6: 0.0000 0.0268 0.1828 0.3071 0.0753 0.4080
##
## $Restaurants
## Pr(1) Pr(2) Pr(3) Pr(4) Pr(5) Pr(6)
## class 1: 0.0353 0.1802 0.5349 0.1386 0.0007 0.1102
## class 2: 0.0120 0.1818 0.6340 0.1385 0.0264 0.0072
## class 3: 0.0013 0.0143 0.1345 0.5196 0.2514 0.0789
## class 4: 0.0043 0.0275 0.1482 0.2654 0.2250 0.3294
## class 5: 0.0088 0.0307 0.3714 0.5101 0.0378 0.0412
## class 6: 0.0000 0.0464 0.2651 0.3030 0.0232 0.3623
##
## $RetailShops
## Pr(1) Pr(2) Pr(3) Pr(4) Pr(5) Pr(6)
## class 1: 0.0223 0.1638 0.5979 0.1156 0.0000 0.1003
## class 2: 0.0134 0.1623 0.6826 0.1257 0.0159 0.0000
## class 3: 0.0000 0.0202 0.1508 0.4789 0.2886 0.0615
## class 4: 0.0009 0.0276 0.1030 0.3025 0.2332 0.3329
## class 5: 0.0036 0.0420 0.3958 0.4601 0.0384 0.0602
## class 6: 0.0047 0.0285 0.2847 0.2947 0.0241 0.3633
##
## $SignsDirections
## Pr(1) Pr(2) Pr(3) Pr(4) Pr(5) Pr(6)
## class 1: 0.0286 0.1198 0.5808 0.2576 0.0102 0.0029
## class 2: 0.0130 0.0723 0.5394 0.3446 0.0307 0.0000
## class 3: 0.0125 0.0159 0.0546 0.3944 0.5087 0.0139
## class 4: 0.0045 0.0071 0.0287 0.1516 0.7829 0.0252
## class 5: 0.0000 0.0172 0.1359 0.6713 0.1739 0.0017
## class 6: 0.0171 0.0473 0.1648 0.6270 0.1032 0.0405
##
## $Escalators
## Pr(1) Pr(2) Pr(3) Pr(4) Pr(5) Pr(6)
## class 1: 0.0148 0.0762 0.6128 0.2459 0.0118 0.0385
## class 2: 0.0048 0.0588 0.5916 0.2974 0.0474 0.0000
## class 3: 0.0000 0.0032 0.0442 0.4006 0.5356 0.0164
## class 4: 0.0000 0.0000 0.0115 0.0779 0.8154 0.0952
## class 5: 0.0000 0.0113 0.1008 0.6633 0.2151 0.0095
## class 6: 0.0000 0.0155 0.1156 0.5687 0.1389 0.1614
##
## $InformationScreens
## Pr(1) Pr(2) Pr(3) Pr(4) Pr(5) Pr(6)
## class 1: 0.0237 0.1312 0.6105 0.2100 0.0128 0.0118
## class 2: 0.0090 0.0887 0.5985 0.2554 0.0484 0.0000
## class 3: 0.0064 0.0186 0.0366 0.4247 0.5082 0.0056
## class 4: 0.0035 0.0089 0.0288 0.1041 0.7959 0.0588
## class 5: 0.0040 0.0116 0.1168 0.7066 0.1585 0.0025
## class 6: 0.0064 0.0400 0.2215 0.5300 0.0891 0.1129
##
## $InformationBoothsLower
## Pr(1) Pr(2) Pr(3) Pr(4) Pr(5) Pr(6)
## class 1: 0.0164 0.1084 0.3898 0.0215 0.0000 0.4639
## class 2: 0.0000 0.1286 0.7733 0.0658 0.0323 0.0000
## class 3: 0.0124 0.0038 0.0431 0.2158 0.6486 0.0764
## class 4: 0.0000 0.0065 0.0104 0.0643 0.2217 0.6972
## class 5: 0.0000 0.0011 0.1126 0.4679 0.0423 0.3761
## class 6: 0.0054 0.0038 0.0585 0.1433 0.0289 0.7602
##
## $InformationBoothsUpper
## Pr(1) Pr(2) Pr(3) Pr(4) Pr(5) Pr(6)
## class 1: 0.0132 0.0994 0.4227 0.0249 0.0056 0.4342
## class 2: 0.0000 0.0990 0.7825 0.0697 0.0445 0.0042
## class 3: 0.0157 0.0045 0.0316 0.2308 0.6510 0.0664
## class 4: 0.0000 0.0058 0.0134 0.0637 0.2611 0.6559
## class 5: 0.0000 0.0000 0.1031 0.4971 0.0439 0.3559
## class 6: 0.0038 0.0071 0.0667 0.1888 0.0394 0.6942
##
## $SignsRoadways
## Pr(1) Pr(2) Pr(3) Pr(4) Pr(5) Pr(6)
## class 1: 0.0175 0.0921 0.3107 0.1842 0.0157 0.3798
## class 2: 0.0165 0.0841 0.4456 0.3438 0.1015 0.0086
## class 3: 0.0000 0.0040 0.0156 0.1028 0.8657 0.0120
## class 4: 0.0016 0.0162 0.0226 0.1459 0.3097 0.5040
## class 5: 0.0034 0.0345 0.1336 0.6436 0.1749 0.0099
## class 6: 0.0066 0.0090 0.0575 0.1672 0.0548 0.7047
##
## $ParkingFacility
## Pr(1) Pr(2) Pr(3) Pr(4) Pr(5) Pr(6)
## class 1: 0.0243 0.0562 0.1643 0.0415 0.0071 0.7066
## class 2: 0.0120 0.0400 0.5199 0.2354 0.1460 0.0467
## class 3: 0.0000 0.0000 0.0248 0.0846 0.8755 0.0150
## class 4: 0.0052 0.0071 0.0178 0.1006 0.1191 0.7501
## class 5: 0.0023 0.0394 0.1391 0.4872 0.0753 0.2567
## class 6: 0.0066 0.0041 0.0129 0.0186 0.0024 0.9554
##
## $AirTrain
## Pr(1) Pr(2) Pr(3) Pr(4) Pr(5) Pr(6)
## class 1: 0.0082 0.0450 0.1167 0.1675 0.0419 0.6208
## class 2: 0.0124 0.0290 0.3789 0.3012 0.2005 0.0780
## class 3: 0.0000 0.0044 0.0092 0.0626 0.9052 0.0186
## class 4: 0.0000 0.0037 0.0096 0.0430 0.2285 0.7152
## class 5: 0.0014 0.0086 0.0612 0.4268 0.2095 0.2925
## class 6: 0.0009 0.0055 0.0115 0.0986 0.0838 0.7997
##
## $LongTermParking
## Pr(1) Pr(2) Pr(3) Pr(4) Pr(5) Pr(6)
## class 1: 0.0109 0.0332 0.0374 0.0171 0.0126 0.8889
## class 2: 0.0155 0.0441 0.4390 0.1495 0.1595 0.1925
## class 3: 0.0000 0.0025 0.0057 0.0802 0.8396 0.0720
## class 4: 0.0031 0.0016 0.0019 0.0231 0.0508 0.9195
## class 5: 0.0008 0.0083 0.1001 0.2743 0.0424 0.5742
## class 6: 0.0000 0.0004 0.0058 0.0031 0.0077 0.9831
##
## $Airportrentalcarcenter
## Pr(1) Pr(2) Pr(3) Pr(4) Pr(5) Pr(6)
## class 1: 0.0224 0.0827 0.0968 0.0587 0.0064 0.7330
## class 2: 0.0211 0.0277 0.4362 0.2014 0.1673 0.1464
## class 3: 0.0000 0.0000 0.0131 0.0729 0.8587 0.0553
## class 4: 0.0100 0.0049 0.0138 0.0317 0.0796 0.8601
## class 5: 0.0038 0.0313 0.1101 0.3339 0.0800 0.4410
## class 6: 0.0014 0.0028 0.0048 0.0523 0.0281 0.9106
##
## $Overall
## Pr(1) Pr(2) Pr(3) Pr(4) Pr(5) Pr(6)
## class 1: 0.0056 0.0710 0.7115 0.2084 0.0000 0.0036
## class 2: 0.0044 0.0337 0.4881 0.4563 0.0090 0.0085
## class 3: 0.0000 0.0032 0.0543 0.5142 0.4200 0.0083
## class 4: 0.0000 0.0000 0.0147 0.3880 0.5899 0.0074
## class 5: 0.0014 0.0000 0.0461 0.8865 0.0660 0.0000
## class 6: 0.0000 0.0041 0.1872 0.7389 0.0479 0.0218
##
## Estimated class population shares
## 0.1642 0.0729 0.0967 0.1937 0.2194 0.2531
##
## Predicted class memberships (by modal posterior prob.)
## 0.1639 0.0705 0.0965 0.197 0.2217 0.2505
##
## =========================================================
## Fit for 6 latent classes:
## =========================================================
## number of observations: 3234
## number of estimated parameters: 425
## residual degrees of freedom: 2809
## maximum log-likelihood: -49101.67
##
## AIC(6): 99053.34
## BIC(6): 101638
## G^2(6): 47558.76 (Likelihood ratio/deviance statistic)
## X^2(6): 2.647978e+16 (Chi-square goodness of fit)
##
Although this was the optimal number of classes, we decided to reduce the number of classes down to 3 since this was may be too much information to visually digest and comprehend for an end user.
## Conditional item response (column) probabilities,
## by outcome variable, for each class (row)
##
## $ArtworkExhibitions
## Pr(1) Pr(2) Pr(3) Pr(4) Pr(5) Pr(6)
## class 1: 0.0007 0.0114 0.1240 0.2828 0.1914 0.3896
## class 2: 0.0039 0.0168 0.1346 0.4704 0.2824 0.0918
## class 3: 0.0086 0.0767 0.4084 0.2825 0.0721 0.1517
##
## $Restaurants
## Pr(1) Pr(2) Pr(3) Pr(4) Pr(5) Pr(6)
## class 1: 0.0032 0.0320 0.1928 0.2972 0.1000 0.3748
## class 2: 0.0040 0.0239 0.2190 0.5304 0.1726 0.0501
## class 3: 0.0204 0.1377 0.5412 0.2135 0.0072 0.0800
##
## $RetailShops
## Pr(1) Pr(2) Pr(3) Pr(4) Pr(5) Pr(6)
## class 1: 0.0031 0.0283 0.1715 0.3130 0.1071 0.3769
## class 2: 0.0014 0.0295 0.2325 0.4979 0.1838 0.0549
## class 3: 0.0145 0.1225 0.6032 0.1771 0.0049 0.0778
##
## $SignsDirections
## Pr(1) Pr(2) Pr(3) Pr(4) Pr(5) Pr(6)
## class 1: 0.0111 0.0267 0.0896 0.4384 0.3973 0.0369
## class 2: 0.0027 0.0107 0.0598 0.4919 0.4279 0.0071
## class 3: 0.0194 0.0843 0.4641 0.4097 0.0201 0.0024
##
## $Escalators
## Pr(1) Pr(2) Pr(3) Pr(4) Pr(5) Pr(6)
## class 1: 0.0000 0.0059 0.0475 0.3669 0.4366 0.1431
## class 2: 0.0000 0.0042 0.0476 0.4765 0.4623 0.0095
## class 3: 0.0077 0.0561 0.4761 0.4007 0.0324 0.0270
##
## $InformationScreens
## Pr(1) Pr(2) Pr(3) Pr(4) Pr(5) Pr(6)
## class 1: 0.0065 0.0227 0.1190 0.3611 0.3929 0.0979
## class 2: 0.0013 0.0118 0.0432 0.5166 0.4230 0.0040
## class 3: 0.0151 0.0898 0.4915 0.3753 0.0204 0.0079
##
## $InformationBoothsLower
## Pr(1) Pr(2) Pr(3) Pr(4) Pr(5) Pr(6)
## class 1: 0.0032 0.0040 0.0197 0.1079 0.0920 0.7732
## class 2: 0.0000 0.0047 0.0735 0.4032 0.3502 0.1684
## class 3: 0.0111 0.0758 0.3908 0.1128 0.0101 0.3994
##
## $InformationBoothsUpper
## Pr(1) Pr(2) Pr(3) Pr(4) Pr(5) Pr(6)
## class 1: 0.0038 0.0056 0.0240 0.1361 0.1165 0.7139
## class 2: 0.0013 0.0012 0.0579 0.4207 0.3579 0.1611
## class 3: 0.0078 0.0678 0.4120 0.1255 0.0130 0.3739
##
## $SignsRoadways
## Pr(1) Pr(2) Pr(3) Pr(4) Pr(5) Pr(6)
## class 1: 0.0025 0.0143 0.0359 0.1796 0.1529 0.6148
## class 2: 0.0026 0.0069 0.0507 0.3994 0.5341 0.0064
## class 3: 0.0144 0.0758 0.2969 0.3039 0.0457 0.2634
##
## $ParkingFacility
## Pr(1) Pr(2) Pr(3) Pr(4) Pr(5) Pr(6)
## class 1: 0.0053 0.0079 0.0155 0.0645 0.0323 0.8746
## class 2: 0.0010 0.0052 0.0705 0.3499 0.4673 0.1060
## class 3: 0.0158 0.0522 0.2263 0.1485 0.0320 0.5252
##
## $AirTrain
## Pr(1) Pr(2) Pr(3) Pr(4) Pr(5) Pr(6)
## class 1: 0.0000 0.0045 0.0091 0.0711 0.1424 0.7729
## class 2: 0.0008 0.0038 0.0265 0.2731 0.5315 0.1643
## class 3: 0.0072 0.0312 0.1554 0.2458 0.0995 0.4610
##
## $LongTermParking
## Pr(1) Pr(2) Pr(3) Pr(4) Pr(5) Pr(6)
## class 1: 0.0015 0.0000 0.0031 0.0080 0.0152 0.9723
## class 2: 0.0000 0.0061 0.0502 0.2424 0.4017 0.2995
## class 3: 0.0086 0.0269 0.1368 0.0708 0.0372 0.7197
##
## $Airportrentalcarcenter
## Pr(1) Pr(2) Pr(3) Pr(4) Pr(5) Pr(6)
## class 1: 0.0063 0.0040 0.0095 0.0467 0.0461 0.8874
## class 2: 0.0000 0.0066 0.0509 0.2523 0.4229 0.2673
## class 3: 0.0162 0.0583 0.1696 0.1243 0.0453 0.5863
##
## $Overall
## Pr(1) Pr(2) Pr(3) Pr(4) Pr(5) Pr(6)
## class 1: 0.0000 0.0021 0.0913 0.6140 0.2760 0.0166
## class 2: 0.0012 0.0013 0.0316 0.6637 0.2983 0.0038
## class 3: 0.0035 0.0398 0.4836 0.4693 0.0000 0.0039
##
## Estimated class population shares
## 0.4002 0.2402 0.3595
##
## Predicted class memberships (by modal posterior prob.)
## 0.3992 0.2418 0.359
##
## =========================================================
## Fit for 3 latent classes:
## =========================================================
## number of observations: 3234
## number of estimated parameters: 212
## residual degrees of freedom: 3022
## maximum log-likelihood: -52301.23
##
## AIC(3): 105026.5
## BIC(3): 106315.7
## G^2(3): 53957.88 (Likelihood ratio/deviance statistic)
## X^2(3): 5.77504e+17 (Chi-square goodness of fit)
##
As a result for using 3 latent classes, we can see the share of the population listed above the variables and how they would on average score the different airport attributes. Class 3 represents the largest portion of the population and correlates to the greatest average overall score. In this analysis, we can see that class 3 typically ranks the AirTrain, parking facilities, screen monitors both in the upper and lower section of the airport, and the rental car station to correlate to the highest overall score of the airport. Class 1 who typically ranks the airport’s overall score the lowest, poorly ranks the artwork, restaurants, retail shops, and does not place as great of an emphasis on the screens and monitors. Class 2 ranks the airport fairly even across the board and does not highlight many features as being better than others.
In terms of broad themes across this analysis, it appears that the parking facilities, airtrain, long term parking, and more practical features lead to having a heavier weight on the overall score of the SFO airport. In looking for areas to improve, it appears that the restaurants, retail shops, and other smaller amenities could be upgraded as their scores fluctuated especially within the group that represented the most disgruntled visitors.
We used sentiment analysis and topic models to determine common concepts in free response questions. This is because sentiment analysis is known as a natural language processing technique that quantifies the feeling behind text. Topic modeling is a machine learning technique which identifies similar content in text. Thus, the topic models were used to gain a further understanding of the comments.
Initially, we decided to perform 4 various sentiment analysis on
customers free response variable Q7_text_All. The following
sentiment analysis performed were based on: (i) the airline terminal,
(ii) the scheduled flight time to leave, (iii) the airline type, and
(iv) whether they are connecting from another flight. In addition, the
sentiment for each comment were calculated using the bing lexicon, where
insights were gathered from groups in the data.
Airline Terminal: Note that response indicating 1 and 3
refers to terminals 1 and 3, while 2 refers to the international
terminal.## # A tibble: 3 × 2
## TERM sentiment
## <int> <dbl>
## 1 1 -200
## 2 2 -88
## 3 3 -197
Scheduled Flight Time to leave: Note that responses
indicating 1 refers to AM times before 11 AM, 2 refers to MID times from
11 AM to 5 PM, and 3 refers to PM times from after 5 PM.## # A tibble: 3 × 2
## STRATA sentiment
## <int> <dbl>
## 1 1 -186
## 2 2 -149
## 3 3 -150
Airline Type: Note that responses indicating 1 refers to
major cities, 2 refers to small/ international carriers, and 3 refers to
new carriers.## # A tibble: 3 × 2
## ATYPE sentiment
## <int> <dbl>
## 1 1 -392
## 2 2 -118
## 3 3 25
Whether they are connecting from another flight: Note that
responses indicating 1 refers yes to and 2 refers to no.## # A tibble: 2 × 2
## Q1 sentiment
## <int> <dbl>
## 1 1 -57
## 2 2 -426
As a result, when looking at the terminals, individuals flying out of Terminal 1 and Terminal 3 tend to have more negative comments than those flying out of the International Terminal. Respondents with flights scheduled to depart before 11am have slightly more negative comments than those departing later in the day. Respondents flying on a major carrier have tend to have very negative comments and those flying on small or international carriers tend to have slightly negative comments. Only individuals flying on a new carrier tend to have positive comments. Respondents who are not connecting from another flight have significantly more negative comments than those connecting. Lastly, respondents who drove a rental car tend to have very negative comments while all other modes of transportation tend to have neutral comments. These grouped sentiments indicate which areas the airport can improve upon to have a large impact.
Now, we performed topic models to examine the determine the number of optimal theme topics are centered among customer free responses overall. Initially, we determine the optimal number of topics using k to see what’s tending among the customer responses
## Building corpus...
## Converting to Lower Case...
## Removing punctuation...
## Removing stopwords...
## Removing numbers...
## Stemming...
## Creating Output...
## Removing 6 of 227 terms (6 of 12498 tokens) due to frequency
## Your corpus now has 1557 documents, 221 terms and 12492 tokens.
When examining k values from 3 to 20, we use the Semantic Coherence and Residual plots to determine the optimal number of k to use in our topic model. As a result, k = 15 demonstrated the best with displaying high semantic coherence and the lowest residuals in comparison to the others overall.
## Topic 1 Top Words:
## Highest Prob: comment, enoughne, move, upgrad, walkwaysescalatorselev, workingnot, posit
## FREX: enoughne, move, upgrad, walkwaysescalatorselev, workingnot, comment, understandhear
## Lift: enoughne, move, upgrad, walkwaysescalatorselev, workingnot, understandhear, comment
## Score: comment, understandhear, enoughne, upgrad, walkwaysescalatorselev, workingnot, move
## Topic 2 Top Words:
## Highest Prob: find, get, termin, airportdifficult, confusinghard, correct, outsid
## FREX: airportdifficult, confusinghard, correct, outsid, find, termin, get
## Lift: airportdifficult, awaynot, confusinghard, correct, enoughhard, hard, outsid
## Score: find, airportdifficult, confusinghard, correct, outsid, get, signag
## Topic 3 Top Words:
## Highest Prob: get, far, awaydifficult, car, center, rental, rude
## FREX: awaydifficult, car, center, rental, rude, toconfusingemploye, airtrainbart
## Lift: carts-, enoughinconveni, locatedshould, luggag, airtrainbart, improv, tofromon
## Score: awaydifficult, car, center, rental, rude, toconfusingemploye, get
## Topic 4 Top Words:
## Highest Prob: inform, chang, enoughlack, informationdisplay, rapid, screens-, smallnot
## FREX: inform, chang, enoughlack, informationdisplay, rapid, screens-, smallnot
## Lift: traffic, acheat, air, booth, chang, control, delaysbad
## Score: inform, informationdisplay, rapid, screens-, smallnot, chang, enoughlack
## Topic 5 Top Words:
## Highest Prob: better, food, expens, one, servicesamen, termin, shopsservic
## FREX: food, better, one, servicesamen, expens, shopsservic, termin
## Lift: food, one, servicesamen, shopsservic, better, expens, termin
## Score: better, shopsservic, food, expens, servicesamen, one, termin
## Topic 6 Top Words:
## Highest Prob: need, uniqu, restaur, shop, secur, hous, starbuckspeetscoffe
## FREX: uniqu, restaur, shop, hous, starbuckspeetscoffe, type, bilingu
## Lift: cleannot, effici, entertain, hous, morenot, open, restaur
## Score: uniqu, need, restaur, bilingu, shop, hous, starbuckspeetscoffe
## Topic 7 Top Words:
## Highest Prob: area, get, termin, long, airportno, around, awayconfusingtoo
## FREX: area, airportno, around, awayconfusingtoo, difficulttak, intern, layout-termin
## Lift: busi, everywher, foodwatersoda, garden, hot, locker, loung
## Score: area, get, needed-children’, pet, play, restroom, sale
## Topic 8 Top Words:
## Highest Prob: securitycustom, longinefficientineffect, personnel, inefficientrudenot, train, well, negat
## FREX: securitycustom, longinefficientineffect, inefficientrudenot, train, well, personnel, negat
## Lift: negat, longinefficientineffect, securitycustom, inefficientrudenot, train, well, personnel
## Score: longinefficientineffect, securitycustom, personnel, train, inefficientrudenot, well, negat
## Topic 9 Top Words:
## Highest Prob: posit, general, sfo, cleanli, wifi, pos, transportation-po
## FREX: sfo, general, pos, posit, artworkexhibit, wifi, cleanli
## Lift: artworkexhibit, sfo, pos, general, transportation-po, wifi, cleanli
## Score: sfo, general, posit, artworkexhibit, cleanli, pos, wifi
## Topic 10 Top Words:
## Highest Prob: enough, low, qualityne, healthier, select, amen, clean
## FREX: enough, amen, clean, hook, miss, restrooms-, small
## Lift: amen, amenities-clock, atm’, check-, clean, curbsid, hand
## Score: enough, low, qualityne, select, healthier, small, hook
## Topic 11 Top Words:
## Highest Prob: need, facil, look, oldoutd, upgradingairport, electr, outlet
## FREX: look, oldoutd, upgradingairport, electr, outlet, signsannouncementspersonnel, transport
## Lift: transport, look, oldoutd, restaurantsstoresclub, upgradingairport, camera, electr
## Score: signsannouncementspersonnel, oldoutd, look, upgradingairport, need, facil, outlet
## Topic 12 Top Words:
## Highest Prob: linesprocedur, neg, artwork, artworkexhibitionschang, frequent, fast, foodchain
## FREX: linesprocedur, artwork, artworkexhibitionschang, frequent, neg, allow, appreci
## Lift: allow, appreci, artwork, artworkexhibitionschang, cut, frequent, front
## Score: linesprocedur, neg, frequent, artwork, artworkexhibitionschang, foodchain, fast
## Topic 13 Top Words:
## Highest Prob: free, avail, accessdoesn’t, airportoverloadeddidn’t, cover, enoughdifficult, entir
## FREX: accessdoesn’t, airportoverloadeddidn’t, cover, enoughdifficult, entir, know, wifi-
## Lift: electronicautom, accessdoesn’t, airportoverloadeddidn’t, cover, enoughdifficult, entir, know
## Score: accessdoesn’t, airportoverloadeddidn’t, cover, enoughdifficult, entir, know, wifi-
## Topic 14 Top Words:
## Highest Prob: airlin, airport, signag, confusingsmallhard, gate, insid, find
## FREX: airlin, confusingsmallhard, gate, insid, signag, airport, don’t
## Lift: confusingsmallhard, don’t, gate, insid, airlin, signag, airport
## Score: confusingsmallhard, gate, insid, find, signag, airlin, airport
## Topic 15 Top Words:
## Highest Prob: seat, airport, need, crowdedmor, areas-, locat, neededinconveni
## FREX: seat, crowdedmor, airport, areas-, locat, neededinconveni, smoke
## Lift: areas-, crowdedmor, locat, neededinconveni, seat, smoke, airport
## Score: seat, crowdedmor, neededinconveni, need, airport, smoke, locat
## $dispersion
## [1] 0.8307205
##
## $pvalue
## [1] 1
##
## $df
## [1] 63273
Based on the dispersion test statistic, the data doesn’t vary as much from the mean data suggesting the data distribution in this model to have to low dispersion, which is good for high precision. However, due to computational simplicity we decided to base all of our interpretations on the examination of 5 topics in our analyses moving forward.
## Topic 1 Top Words:
## Highest Prob: get, far, termin, airport, airlin, awaydifficult, car
## FREX: far, awaydifficult, car, center, rental, rude, toconfusingemploye
## Lift: add, air, allow, appreci, artwork, artworkexhibitionschang, check
## Score: far, get, car, center, rental, rude, toconfusingemploye
## Topic 2 Top Words:
## Highest Prob: better, uniqu, find, airport, signag, termin, shop
## FREX: better, shop, uniqu, airportdifficult, confusinghard, correct, outsid
## Lift: airportdifficult, confusinghard, correct, one, outsid, servicesamen, shop
## Score: better, uniqu, airportdifficult, confusinghard, correct, outsid, shop
## Topic 3 Top Words:
## Highest Prob: expens, inform, food, chang, enoughlack, informationdisplay, rapid
## FREX: inform, food, chang, enoughlack, informationdisplay, rapid, screens-
## Lift: chang, enoughlack, healthier, hous, informationdisplay, low, qualityne
## Score: chang, enoughlack, smallnot, informationdisplay, rapid, screens-, food
## Topic 4 Top Words:
## Highest Prob: need, restaur, enough, seat, free, crowdedmor, avail
## FREX: enough, seat, free, crowdedmor, accessdoesn’t, airportoverloadeddidn’t, cover
## Lift: areas-, carts-, cleannot, effici, enoughinconveni, locat, locatedshould
## Score: need, enough, restaur, accessdoesn’t, airportoverloadeddidn’t, cover, enoughdifficult
## Topic 5 Top Words:
## Highest Prob: comment, posit, airlin, general, sfo, confusingsmallhard, gate
## FREX: comment, posit, general, sfo, confusingsmallhard, gate, insid
## Lift: amenities-clock, atm’, check-, curbsid, hand, negat, payphon
## Score: comment, area, shopsservic, sfo, posit, general, airlin
## $dispersion
## [1] 0.9508983
##
## $pvalue
## [1] 1
##
## $df
## [1] 130302
As a result, the dispersion test statistic indicates the data may vary too much from the mean data suggesting the data distribution for this model to have to higher dispersion. We are aware this is not desirable and take that into consideration when examining our modeling analyses. As for the five centered topics developed around customers free responses were as follows (i) topic one pertains to the crowding or overloading within the airport, (ii) topic two references the general areas and amenities, (iii) topic three is about the security screening process and it’s issues such as long lines and inefficiency, (iv) topic four contains information on the restaurants, food, and shops and their costs, and (v) topic five is about finding the gates and getting lost. These topics indicate what people are talking about the most. Using this information, airport improvements can be prioritized to address these comments.
We concluded that the responses examined in section A.3. to support different claims in section A.1. and A.2., In terms of section A.1., the results based on the sentimental analysis conducted in part A.3. suggested opposing conclusions among customer satisfactory. While section A.1. highlights most customer ratings to be of 4 or 5 (i.e., above-average or outstanding), section A.3. suggests that the majority of customers free responses to have a negative tone in their feedback on SFO. Although it was a shock to see this, it did however make sense on our conclusions of customers demographic characteristics not showing statistical evidence to influencing their satisfactory level. In terms of section A.2., the results based on the topic model analysis conducted in part A.3. did however relate to similar conclusions. Thus suggesting that customers are more or less using the free response section as more of a suggesting box of areas to improve on SFO regardless of rating their overall experience to being more than average.
In Part 1 of the project, our team had conducted EDA and EFA evaluations on the dataset.Thus foresting into a set of hypothesis we set out to investigate. Recall that the hypothesis we had proposed were the following:
We presume overall that: (i) males fly more frequently than females; (ii) ages from 25 to 54 have the highest flying frequency with a bimodal distribution of 25-34 and 45-54 being the peaks; and (iii) lower income flyers fly locally, higher income flyers fly internationally, and middle-class flyers fly domestically on average.
We presume that the frequency of flyers out of SFO is internationally but varies based on other factors like the passenger’s socioeconomic status.
We presume that the higher the rating for SFO attributes, the higher the frequency of flights flown out of SFO.
We presume that the lower the rating for SFO cleanliness, the lower the ratings for SFO attributes.
We presume that the lower the safety ratings is at SFO, the lower the frequency of flights flown out of SFO.
Further investigations will be detailed discussed in Section B.2.
The SFO executives feel that additional insights can be gained from the customer satisfaction survey dataset. Based on your prior EDA deliverable and the topics we have discussed in class, develop an additional research question and execute a plan to evaluate it with these data using a method we covered this semester. Provide an appropriate explanation of your method of choice and how it applies to your question. If formal hypotheses are tested, clearly explain the results of these tests. If the method is more descriptive or data-driven, define how the results are evaluated, and provide sufficient output and data visuals to communicate the outcome. You don’t need to fish for a “significant” finding here; even null or unexpected results can be useful if the hypothesis is reasonable.
Do passengers that fly internationally out of SFO have varying factors such as, the passenger’s socioeconomic status, compared to passengers that fly in state or out of state?
We presume that the frequency of flyers out of SFO is internationally but varies based on other factors such as the passenger’s socioeconomic status.
The plan for this analysis is to begin with data pre proccesing to deal with missing values and to structure our data. Then we will do some EDA and explore the data to see if there’s any correlations between the variables, the number of flights to destinations, word clouds from reviews/comments, but also to clean up the data to answer our question.Following that we will use random forests and kmeans clustering to answer our research question.
We then will use kmeans clustering on our cleaned data which uses the average in each column to fill the Na’s. We chose K-means clustering because we wanted to see if there were any specific characteristics from the survey for people that flew out out of state, in-state, and internationally in order to see if they truly are different types of passengers depending where they are flying.
All EDA and Data Pre-processing were performed in Part 1 of the
project. Overall, we noticed many survey variables had large amounts of
missing values, where some had no records at all. Each of the variables
were customer survey responses pertaining to a customer’s flight
information. We decided to remove variables that had more than 20% of
missing data. We examined the variable’s descriptive statistics along
with their correlations amongst one another. Note that instead of
removing NA values, we replaced them with the averages of the current
data in the columns. We noticed that Questions 5-8 had shown the
strongest presence of correlation. In the majority of cases, we can see
a positive correlation present between questions, where the strongest
positive correlation between the responses of sub questions were 6 (A
through N) and 8 (A through F). In addition, Two points that stood out
the most was the correlation between question 1 among question 12 and
question 15 with have the highest negative correlation compared to the
others. Q1 and Q12 having a -0.41 correlation and Q1 and Q15 having a
-0.65 correlation. More specifically, we had also examined customer’s
feedback using the Q7_text_all variable by creating a word
cloud. The word cloud contained the most used words said by customers
when commenting with the most frequent words appeared larger in size.
Thus providing an initial idea of what customers may find important or
feel. As a result, it seems the most frequent words used were
better, find, positive,
expensive, and food.
| ATYPE | WithinCalifornia | OutOfState | OutOfCountry |
|---|---|---|---|
| 1 | 373 | 1557 | 195 |
| 2 | 69 | 201 | 521 |
| 3 | 0 | 0 | 318 |
By looking at the type of airline and destinations, we noticed that the majority of flights were on a major airline carrier, traveling out of California, and within the United States. Whereas, new carriers only flew out of the country. We can use clustering to determine if these passengers answered the survey similarly.
We will first start by looking at a variable importance plot from a
random forest to determine what variables are most important in terms of
the destination.
It’s clear that DESTGEO is correlated with the destination since it’s an assigned code providing area of the world flight is destined, so this could be an interesting variable for our analysis since it’s more specific that the variable we are currently looking at but should be removed. Gate number, Weight, Intdate, Terminal, Airline type, Airline also is correlated and should be removed since it has nothing to do with the passenger and more about the logistics of the airport. Since this is the case I’m going to remove those variables and do another variable importance plot.
So based of the variable importance plot the variables that seem to matter most when looking at if you’re traveling in-state, out of state, or international are what country you are from, the market size of the destination airport, the time of day the flight is scheduled to leave, and whether or not you have residence in the Bay Area. These all make sense as to why these variables are important. The six questions that were most important were what country you departed from, if you checked baggage, your age, if you live in the bay area, how they rate the cleanliness of boarding areas, and how they got to the airport. These all make sense but the most interesting ones are the age, if you checked baggage, and the rating of cleanliness. These could be telling signs that there are certain characteristics and backgrounds that matter when determining if a passenger is flying in state, out of state, or international. We are now going to do a kmeans clustering analysis to determine if these thoughts are correct, and see if there are clear groups of passengers.
The first thing we are going to do is to see if the gap statistic method recommends a certain number of clusters is visible from the top 23 most important variables from the plot above. 23 was chosen since there seemed to be a drop in mean decrease gini after the 23rd variable (Q6F).
It seems the gap statistic method suggests using 10 clusters. Lets looks at the elbow method as well.
It seems like the elbow is at 2 which means that this method suggest using 2 clusters. Lets take a look at the last method we will be using which is the silhouette method.
The silhouette method also suggests using 2 clusters. Since this is the case we will look at three clusters in order to see if they are in fact grouped by the out of state, in state, and international destinations.
Unfortunately when looking at this graphic above you can tell that some points look like the could be in both or all three of the clusters which is not ideal. This means that there doesn’t seem to be any grouping of passengers it seems which goes against our hypothesis. However let’s still take a look at the means of these clusters.
## COUNTRY DESTMARK STRATA SFFLAG Q12 Q4A
## 1 -0.1786606 -0.07719590 0.008825827 -0.2663958 0.1203106 -0.067520879
## 2 0.4708189 0.17207500 -0.119626514 0.9835120 -0.4556205 0.106247350
## 3 -0.1251081 -0.03125951 0.076242292 -0.3911932 0.1849916 0.004837363
## Q17 Q15 Q8A Q3_1 Q6E Q2_1
## 1 0.05677257 0.3028992 0.50078141 -0.01077961 0.5731969 0.17817744
## 2 0.06239887 -0.9511022 -0.08321762 -0.42471082 -0.1656354 -0.20731000
## 3 -0.11438113 0.3232551 -0.54827397 0.32187944 -0.5763862 -0.06588204
## Q19 Q13B Q6A Q8F Q6C Q13A
## 1 -0.28750301 0.36846608 0.43874134 0.5599800 0.6148981 0.2926755
## 2 0.44058717 -0.06849857 -0.06068001 -0.2276558 -0.3573986 0.1613276
## 3 0.02918551 -0.39812587 -0.48924031 -0.5152287 -0.4876636 -0.4730803
## Q6B Q5AVG Q5 Q8E Q6F DEST
## 1 0.6469607 -0.3278072 -0.4235364 0.6480742 0.51334834 0.105230296
## 2 -0.3624481 0.8657984 1.0042848 -0.3401270 -0.09917135 -0.175612996
## 3 -0.5229695 -0.2309577 -0.2152675 -0.5405511 -0.55195221 -0.000248309
By looking at the DEST variable it seems that most people traveling in state are in cluster 3 (Dest = 1 and is lowest mean), most people traveling out of state are in cluster 2 (Dest = 2 and is second highest mean), and most people traveling internationally are in cluster 1 (Dest = 3 and is highest mean). Judging by this we can make, not very valid, but some assumptions about each type of passenger. For example, Q19 is about household income, and it seems people with the highest household income are in cluster 3 which is the same cluster as people who travel more instate. This can make sense since people with higher incomes are more likely to travel by plane in state since it’s quicker, rather than perhaps spending less money by taking a car. Then another interesting variable to look at is age where it seems to be the youngest people are in cluster 2 which means people of younger age tend to travel more out of state, while the oldest people are in cluster 3 which are more in state travelers.
In all though, there didn’t seem to be any clear groupings of these passengers. Since this is the case, we fail to answer our hypothesis that passengers have similar characteristics based on whether you are flying in state, out of state, or internationally. The takeaway from this is that although there wasn’t any patterns from the passengers about their destination there still could be more ways to group passengers in order to target groups with specific needs or accommodations.
# load libraries
library(tidyverse)
library(dplyr)
library(ggplot2)
library(zoo)
library(tidyr)
library(knitr)
library(poLCA)
library(tidytext)
library(sentimentr)
library(stm)
library(tm)
library(psych)
library(wordcloud)
library(corrplot)
library(factoextra)
library(randomForest)
# Load data
data = read.table("SFO_survey_withText.txt", header = TRUE, sep = "", dec = ".")
# Computes and filters the percentage of missing data from each variable
colSums(is.na(data[colSums(is.na(data)/dim(data)[1]) > 0.3])/dim(data)[1]*100)
colSums(is.na(data[colSums(is.na(data)/dim(data)[1]) <= 0.3])/dim(data)[1]*100)
# Final data.frame for Part A section
data_tidy <- data[colSums(is.na(data)/dim(data)[1]) <= 0.3] #%>% na.omit()
# Data.frame for A.1.
dfa1 <- round(data_tidy %>% dplyr::select(c(Q6N, Q17, Q18, Q19)) %>% na.aggregate(),0)
# Data.frame for A.2.
dfa2 <- round(data_tidy %>% dplyr::select(c(Q6A:Q6N)) %>% na.aggregate(),0)
# Data.frame for A.3.
dfa3 <- data
## Data Pre-Processing
# Response filterization
dfa1 <- dfa1 %>% filter(Q6N != 0, Q17 !=0, Q18 !=0, 19 !=0,
Q6N != 6, Q17 != 8, Q19 !=5)
# Create factor variables for each variable:
# customer ratings
dfa1$Q6N_factor <- factor(as.factor(dfa1$Q6N), levels = c(1,2,3,4,5),
labels = c("Unacceptable","Below-Average","Average",
"Above-Average","Outstanding"))
# age group
dfa1$Q17_factor <- factor(as.factor(dfa1$Q17), levels = c(1,2,3,4,5,6,7),
labels = c("< 18","18-24","25-34","35-44",
"45-54","55-64","65 =<"))
# gender
dfa1$Q18_factor <- factor(as.factor(dfa1$Q18), levels = c(1,2),
labels = c("Male","Female"))
# income level
dfa1$Q19_factor <- factor(as.factor(dfa1$Q19), levels = c(1,2,3,4),
labels = c("< 50k", "50k-100k", "100k-150k","150k <"))
# Create customer rating type variable
dfa1$binary_rating <- dfa1$Q6N_factor
levels(dfa1$binary_rating) <- c("Dissatisfied","Dissatisfied",
"Dissatisfied","Satisfied","Satisfied")
##EDA
## univariate analysis
# Distribution of customer's rating
ggplot(dfa1, aes(x=Q6N_factor)) + geom_bar() +
geom_text(stat='count', aes(label=..count..), vjust = -1) +
xlab( "Ratings for SFO as a whole") +
ggtitle("Distribution of Customer Ratings on SFO on a whole") +
ylim(0,2000)
# Distribution of age
ggplot(dfa1, aes(x=Q17_factor)) + geom_bar() +
geom_text(stat='count', aes(label=..count..), vjust = -1) +
xlab("Range of Age Group (in years)") +
ggtitle("Distribution of Customer's Age Group") +
ylim(0,800)
# Distribution of gender
ggplot(dfa1, aes(x=Q18_factor)) + geom_bar() +
geom_text(stat='count', aes(label=..count..), vjust = -1) +
xlab("Gender") + ggtitle("Distribution of Customer's Gender") +
ylim(0,2000)
# Distribution of household income level
ggplot(dfa1, aes(x=Q19_factor)) + geom_bar() +
geom_text(stat='count', aes(label=..count..), vjust = -1) +
xlab("Range in Household Income (in US Dollars)") +
ggtitle("Distribution of Customer's Household Income") +
ylim(0,1750)
# Distribution of customer's binary rating
ggplot(dfa1, aes(x=binary_rating)) + geom_bar() +
geom_text(stat='count', aes(label=..count..), vjust = -1) +
xlab("Customer Type") + ggtitle("Distribution of Customer's Rating Classification") +
ylim(0,2750)
## bivariate analysis
# Distribution of age
ggplot(dfa1, aes(x=Q17_factor, fill = binary_rating)) +
geom_bar() + geom_text(stat='count', aes(label=..count..), vjust = -1) +
xlab("Range of Age Group (in years)") +
ggtitle("Distribution of Customer's Age Group") +
ylim(0,800)
# Distribution of gender
ggplot(dfa1, aes(x=Q18_factor, fill = binary_rating)) +
geom_bar() + geom_text(stat='count', aes(label=..count..), vjust = -1) +
xlab("Gender") + ggtitle("Distribution of Customer's Gender") +
ylim(0,2000)
# Distribution of household income level
ggplot(dfa1, aes(x=Q19_factor, fill = binary_rating)) + geom_bar() +
geom_text(stat='count', aes(label=..count..), vjust = -1) +
xlab("Range in Household Income (in US Dollars)") +
ggtitle("Distribution of Customer's Household Income") +
ylim(0,1750)
# Rename colnames
# Note that Q6N = customer_rating, Q18 = gender, Q17 = age group, Q19 = income level
colnames(dfa1) <- c('Rating_num','AgeGroup_num','Gender_num','IncomeLevel_num',
'Rating_fa','AgeGroup_fa','Gender_fa','IncomeLevel_fa', 'BinaryRating')
## Multivariate analysis
dfa1 %>% group_by(BinaryRating, Gender_fa, AgeGroup_fa, IncomeLevel_fa) %>% summarise(n=n())
## Model Analysis - LDA
# Creates LCA formula
lca_formula_dfa1 = cbind(IncomeLevel_num, AgeGroup_num, Gender_num, Rating_num) ~ 1
lca_formula_binary_dfa1 = cbind(IncomeLevel_num, AgeGroup_num, Gender_num, BinaryRating) ~ 1
# Performs LCA analysis
#(for each customer rating)
lca_classes3_dfa1 = poLCA(lca_formula_dfa1, dfa1, nclass = 3, maxiter = 5000)
lca_classes4_dfa1 = poLCA(lca_formula_dfa1, dfa1, nclass = 4, maxiter = 5000)
lca_classes5_dfa1 = poLCA(lca_formula_dfa1, dfa1, nclass = 5, maxiter = 5000)
lca_classes6_dfa1 = poLCA(lca_formula_dfa1, dfa1, nclass = 6, maxiter = 5000)
#(for binary classifier)
lca_classes3_binary_dfa1 = poLCA(lca_formula_binary_dfa1, dfa1, nclass = 3, maxiter = 5000)
lca_classes4_binary_dfa1 = poLCA(lca_formula_binary_dfa1, dfa1, nclass = 4, maxiter = 5000)
lca_classes5_binary_dfa1 = poLCA(lca_formula_binary_dfa1, dfa1, nclass = 5, maxiter = 5000)
lca_classes6_binary_dfa1 = poLCA(lca_formula_binary_dfa1, dfa1, nclass = 6, maxiter = 5000)
# Reports each lca model AIC values
#(for each customer rating)
rbind(class3_dfa1 = lca_classes3_dfa1$aic,
class4_dfa1 = lca_classes4_dfa1$aic,
class5_dfa1 = lca_classes5_dfa1$aic,
class6_dfa1 = lca_classes6_dfa1$aic
)
#(for binary classifier)
rbind(class3_binary_dfa1 = lca_classes3_binary_dfa1$aic,
class4_binary_dfa1 = lca_classes4_binary_dfa1$aic,
class5_binary_dfa1 = lca_classes5_binary_dfa1$aic,
class6_binary_dfa1 = lca_classes6_binary_dfa1$aic
)
# Reports lca model for class 4:
# using each customer rating
lca_classes5_dfa1
plot(lca_classes5_dfa1)
# using binary classifier rating
lca_classes4_binary_dfa1
plot(lca_classes4_binary_dfa1)
## Data Pre-Processing & EDA
# Rename Variables
colnames(dfa2) <- c('ArtworkExhibitions', 'Restaurants', 'RetailShops', 'SignsDirections', 'Escalators' ,'InformationScreens', 'InformationBoothsLower', 'InformationBoothsUpper', 'SignsRoadways', 'ParkingFacility', 'AirTrain', 'LongTermParking','Airportrentalcarcenter', 'Overall')
# Compute variable means
colMeans(dfa2)
## Model Analysis - LDA
# Creates LCA formula
f <- cbind(ArtworkExhibitions, Restaurants, RetailShops, SignsDirections, Escalators, InformationScreens, InformationBoothsLower, InformationBoothsUpper, SignsRoadways, ParkingFacility, AirTrain, LongTermParking, Airportrentalcarcenter, Overall) ~ 1
# Performs LCA analysis
lca_classes3_dfa2 <- poLCA(f, dfa2, nclass = 3)
lca_classes4_dfa2 <- poLCA(f, dfa2, nclass = 4)
lca_classes5_dfa2 <- poLCA(f, dfa2, nclass = 5)
lca_classes6_dfa2 <- poLCA(f, dfa2, nclass = 6)
# Reports each lca model AIC values
rbind(class3_dfa2 = lca_classes3_dfa2$aic,
class4_dfa2 = lca_classes4_dfa2$aic,
class5_dfa2 = lca_classes5_dfa2$aic,
class6_dfa2 = lca_classes6_dfa2$aic)
# Reports lca model for class 6
lca_classes6_dfa2
plot(lca_classes6_dfa2)
# Reports lca model for class 3
lca_classes3_dfa2
plot(lca_classes3_dfa2)
## Model Analysis - Sentiment & Topic Model Analysis
# Define sentiment
sentiment <- dfa3 %>%
unnest_tokens(tbl = ., output = word, input = Q7_text_All) %>%
group_by(RESPNUM) %>%
inner_join(get_sentiments("bing")) %>%
count(sentiment) %>%
spread(sentiment, n, fill = 0) %>%
mutate(sentiment = positive - negative)
maxIdxs <- order(sentiment$sentiment, decreasing=TRUE)[1:3]
minIdxs <- order(sentiment$sentiment, decreasing=FALSE)[1:3]
# Sentiment Analysis based on airline terminal
termSentiment <- dfa3 %>%
unnest_tokens(tbl = ., output = word, input = Q7_text_All) %>%
group_by(TERM) %>%
inner_join(get_sentiments("bing")) %>%
count(sentiment) %>%
spread(sentiment, n, fill = 0) %>%
group_by(TERM) %>%
summarize(sentiment = mean(positive) - mean(negative))
termSentiment
# Sentiment Analysis based on scheduled flight time to leave
strataSentiment <- dfa3 %>%
unnest_tokens(tbl = ., output = word, input = Q7_text_All) %>%
group_by(STRATA) %>%
inner_join(get_sentiments("bing")) %>%
count(sentiment) %>%
spread(sentiment, n, fill = 0) %>%
group_by(STRATA) %>%
summarize(sentiment = mean(positive) - mean(negative))
strataSentiment
# Sentiment Analysis based on airline type
atypesentiment <- dfa3 %>%
unnest_tokens(tbl = ., output = word, input = Q7_text_All) %>%
group_by(ATYPE) %>%
inner_join(get_sentiments("bing")) %>%
count(sentiment) %>%
spread(sentiment, n, fill = 0) %>%
group_by(ATYPE) %>%
summarize(sentiment = mean(positive) - mean(negative))
atypesentiment
# Sentiment Analysis based on whether its a connecting flight
q1Sentiment <- dfa3 %>%
unnest_tokens(tbl = ., output = word, input = Q7_text_All) %>%
group_by(Q1) %>%
inner_join(get_sentiments("bing")) %>%
count(sentiment) %>%
spread(sentiment, n, fill = 0) %>%
group_by(Q1) %>%
summarize(sentiment = mean(positive) - mean(negative))
q1Sentiment %>% na.omit()
# Performs Topic Model Analysis
txt <- textProcessor(documents = dfa3$Q7_text_All, metadata = dfa3)
prep <- prepDocuments(documents = txt$documents, vocab = txt$vocab, meta = txt$meta)
# Identifies optimal K value of topics
kTest = searchK(documents = prep$documents, vocab = prep$vocab,
K = c(3, 4, 5, 10, 15, 20), verbose = FALSE)
plot(kTest)
# Computes Model
topic15 <- stm(documents = prep$documents, vocab = prep$vocab, K = 15, verbose = FALSE)
# Plots theme topic most used words
plot(topic15)
# Highlights theme topic most used words
labelTopics(topic15)
# Reports dispersion statistic
checkResiduals(topic15, documents = prep$documents)
# Computes Model
topic5 <- stm(documents = prep$documents, vocab = prep$vocab, K = 5, verbose = FALSE)
# Plots theme topic most used words
plot(topic5)
# Highlights theme topic most used words
labelTopics(topic5)
# Reports dispersion statistic
checkResiduals(topic5, documents = prep$documents)
# load libraries
library(tidyverse)
library(dplyr)
library(ggplot2)
library(tm)
library(wordcloud)
library(tidyr)
library(corrplot)
library(psych)
library(knitr)
library(factoextra)
## Get Data
# load data
data = read.table("SFO_survey_withText.txt", header = TRUE, sep = "", dec = ".")
# provides a glimpse of the data
glimpse(data)
## Data Pre-processing
# examines location of missing values
colSums(is.na(data))
# re-organize data.frame layout
data_tidy <- data %>% dplyr::select(RESPNUM, CCGID, Location.1, MAIL, LANG,
INTDATE, STRATA, TERM, GATENUM, DESTMARK, WEIGHT, ATYPE, AIRLINE, DEST, DESTGEO, COUNTRY, SFFLAG, Q1:Q15, Q17, Q18, Q19,
AGE, INC, DESTINATION, Q7_text_All)
# removal of variables that have over 50% of missing values & 4 meaningless variables
data_tidy_final <- data_tidy %>% dplyr::select(-c(RESPNUM, CCGID, Location.1, AGE, INC, Q3_5, Q3_6, Q8COM3, Q9ANEG3, Q9ANTR3, Q13_COM, Q13COM3, Q14A, Q2_5, Q2_6, Q3_4, Q2_3, Q2_4, Q3_3, Q7COM3, Q7A1, Q7A2, Q7A3, Q8COM1, Q8COM2, Q9APOS2, Q9APOS3, Q9ANEG1:Q9ANEG2, Q9ANTR1:Q9ANTR2, Q11A1:Q11A3, Q13COM2,Q14A2, Q14A3, Q7COM2, Q9APOS1, Q11, Q13COM1, Q14A1, Q7COM1))
dim(data_tidy_final)
# percentage of data removed
max(colSums(is.na(data_tidy_final)))/dim(data_tidy_final)[1]*100
## EDA
# unique(data$DESTINATION)
sort(table(data_tidy_final$DESTINATION))
barplot(sort(table(data_tidy_final$DESTINATION)))
# Creates worldcloud
cloud.df <- Corpus(DataframeSource(data.frame(doc_id = 1, text = data_tidy_final[, "Q7_text_All"])))
cloud.df <- tm_map(cloud.df, removePunctuation)
cloud.df <- tm_map(cloud.df, content_transformer(tolower))
cloud.df <- tm_map(cloud.df, function(x) removeWords(x, stopwords("english")))
tdm <- TermDocumentMatrix(cloud.df)
m <- as.matrix(tdm)
v <- sort(rowSums(m), decreasing=TRUE)
d <- data.frame(word = names(v),freq=v)
wordcloud(d$word,d$freq, scale=c(8,.3),min.freq=2,max.words=100, random.order=T, rot.per=.15, vfont=c("sans serif","plain"))
# Reports Descriptive Statistics of the data
Hmisc::describe(data_tidy_final)
# final dataset used for analysis
eda <- subset(data_tidy_final, select=-c(MAIL, LANG, DESTINATION, Q7_text_All))
avg <- na.aggregate(eda)
## Correlation Analysis
# examines entire dataset
corrplot(cor(avg))
# examines subset of datset
corrplot(cor(avg[1:12]))
corrplot(cor(avg[13:54]))
# reports the highest correlated variables
sub1 <- cor(avg[25:44])
round(sub1, 2)
round(cor(avg[13], avg[47]),2)
round(cor(avg[13], avg[51]),2)
## EFA
# Validates the performance for conducting an EFA
cortest.bartlett(avg)
KMO(avg)
# Identifies the appropriate number of PC and factors
scree(avg)
fa.parallel(avg)
# no rotation
fa1 <- fa(avg, nfactors = 20, rotate = "none", covar = FALSE)
print(fa1$loadings, cutoff = 0.001, digits = 3)
fa1$Phi
# orthogonal rotation
fa2 <- fa(avg, nfactors = 20, rotate = "varimax", covar = FALSE)
print(fa2$loadings, cutoff = 0.001, digits = 3)
fa2$Phi
# oblique rotation
fa3 <- fa(avg, nfactors = 20, rotate = "Promax")
print(fa3$loadings, cutoff = 0.001, digits = 3)
fa3$Phi #look at correlation between latent factors
## Explore (some) Hypothesis
# Examines tabulated data for potential group classes
data_tidy_final %>%
group_by(ATYPE) %>%
summarise(WithinCalifornia=sum(DEST == 1), OutOfState=sum(DEST == 2), OutOfCountry=sum(DEST == 3)) %>%
kable("html", escape = F, caption = "Travel Destinations by Airline Type", table.attr = "style='width:80%;'" )
# Performs cluster analysis
avg %>%
scale() %>%
fviz_nbclust(x = ., FUNcluster = kmeans, method = "gap")
avg %>%
scale() %>%
fviz_nbclust(x = ., FUNcluster = clara, method = "gap")
#fitting random forest and making importance plots
fit_rand <- randomForest(factor(DEST)~.,avg)
varImpPlot(fit_rand)
avg_imp <- avg %>% dplyr::select(-c(GATENUM, WEIGHT, INTDATE, TERM, DESTGEO, ATYPE, AIRLINE))
fit_rand2 <- randomForest(factor(DEST)~.,avg_imp)
varImpPlot(fit_rand2)
#cleaning data for cluster analysis
clust_vars <- avg_imp %>% dplyr::select(COUNTRY, DESTMARK, STRATA, SFFLAG, Q12, Q4A, Q17, Q15, Q8A, Q3_1, Q6E, Q2_1, Q19, Q13B, Q6A, Q8F, Q6C, Q13A, Q6B, Q5AVG, Q5, Q8E, Q6F, DEST)
# Performs gap stat method
clust_vars %>%
scale() %>%
fviz_nbclust(x = ., FUNcluster = kmeans, method = "gap")
#Performs elbow method
clust_vars %>%
scale() %>%
fviz_nbclust(x = ., FUNcluster = kmeans, method = "wss")
#Performs silhouette method
clust_vars %>%
scale() %>%
fviz_nbclust(x = ., FUNcluster = kmeans, method = "silhouette")
#running k-means cluster model
clust_vars_scaled <- clust_vars %>% scale()
kmeans <- kmeans(x = clust_vars_scaled, centers = 3)
fviz_cluster(kmeans, clust_vars_scaled)
#looking at centers of kmeans
kmeans$centers